Currently, our models for low prevalence species are unlikely to be very good because the mdoels are very overfitted. One way of dealing with the low abundance of certain species is by taking prevalence of a species across the entire dataset into account when deciding on the number of pseudoabsences to generate. To do this, while still accounting for sampling biases, we take the number of unique grid references that a species has been seen in and the number of unique grid references across the whole dataset. We then use the proportion of cells that a species of interest appears in to determine how many more pseudoabsences are generated. I.e. if a species is seen in 20% of cells, then we need to have 5x more pseudoabsences than presences. This should mean that the probablility of a species being present in any given location decreases, because they appear <50% of the time in the original dataset, and hopefully lead to more realistic models of presence. I am not sure how this works with random forest models because it is likely that they will then be trained to recognise more ‘absences’ than ‘presences’ which could lead to them only predicting absences.
In order to account for prevalence we’re going to need to reduce the data to only presence and absence points in each grid cell. At the moment, the data we are using has multiple sightings of species in the same grid cell if they are seen on different days of the year (we have only removed repeat sightings from the same day in the same grid cell). If we increase the number of pseudoabsences relative to the number of repeat sightings, we’re going to end up with a hyper-inflated number of absence points for certain species, e.g. species that are common in a very restricted area. If we do want to use prevalence and the repeat sightings data then there might be a way to scale numbers according to how many sightings have appeared in the dataset as well as by their prevalence.
So, I am going to run four logistic regression models for each species:
duplicated data with equal numbers of presence and absence points (‘original way’)
duplicated data with pseudoabsences scaled according to prevalence (‘original + prevalence’)
thinned data matched presence and absence (‘thinned + matched’)
thinned data with pseudoabsences scaled according to prevalence (‘thinned + prevalence’)
## [1] "sea" "broad_wood" "conif_wood"
## [4] "arable" "impr_grass" "neutr_grass"
## [7] "calc_grass" "acid_grass" "fen_marsh_swamp"
## [10] "heather" "heath_grass" "bog"
## [13] "inland_rock" "saltwater" "freshwater"
## [16] "sup_lit_rock" "sup_lit_sed" "lit_rock"
## [19] "lit_sed" "saltmarsh" "urban"
## [22] "suburban" "AnnualTemp" "MeanDiRange"
## [25] "Isotherm" "TempSeasonality" "MaxTempWarmestMonth"
## [28] "MinTempColdestMonth" "TempAnnualRange" "MeanTempWetQuarter"
## [31] "MeanTempDriestQuarter" "MeanTempWarmQuarter" "MeanTempColdQuarter"
## [34] "AnnualPrecip" "PrecipWetMonth" "PrecipDriestMonth"
## [37] "PrecipSeasonality" "PrecipWettestQuarter" "PrecipDriestQuarter"
## [40] "PrecipWarmQuarter" "PrecipColdQuarter" "elevation_UK"
## [43] "slope" "aspect"
Plot number of data points in the whole of the UK and for the region of interest that I’ve chosen. Can see that Tyria jacobaeae is the most recorded species at both extents.
## # A tibble: 6 x 25
## X1 X1_1 TO_ID TO_STARTDATE TO_ENDDATE DT_ID TO_GRIDREF TO_LOCALITY GR_ID
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 165 1410 13/07/2005 13/07/2005 1 NM430268 Isle of Mu~ 1
## 2 2 166 1411 13/07/2005 13/07/2005 1 NM430268 Isle of Mu~ 1
## 3 3 1466 6557 02/07/2005 02/07/2005 1 SK322596 Ash Brook,~ 1
## 4 4 1937 10384 16/07/2002 16/07/2002 1 SV918113 St Mary's ~ 1
## 5 5 2393 13520 21/07/2004 21/07/2004 1 SK440660 Holmewood ~ 1
## 6 6 2394 13521 21/07/2004 21/07/2004 1 SK440660 Holmewood ~ 1
## # ... with 16 more variables: VC_ID <dbl>, SQ_10 <chr>, CONCEPT <chr>,
## # TO_PRECISION <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>, lon <dbl>, lat <dbl>,
## # date <date>, yr <dbl>, jd <dbl>, sp_n <chr>, com_n <chr>, ig <chr>,
## # ag <chr>, id <chr>
## [1] FALSE
Alter the records of presences to get only unique locations of any given species in the data.
thinned_occ <- ndf %>% dplyr::select(lon, lat, sp_n, TO_GRIDREF) %>%
group_by(sp_n, TO_GRIDREF) %>%
distinct() %>%
mutate(species = sp_n,
year = 2015) # need a year column for Rob's function to work, just a dummy year
Calculate prevalence of each species to use in selecting the number of pseudoabsences. First, get number of unique grid cells for each species and the number of records for each species. Second, get the prevalence by dividing the number of unique cells the each species appears in by the number of unique cells across all species (i.e. the total number of cells sampled). Third, remove all duplicate values so have one species per row. Fourth, calculate the number of pseudo absences needed for the duplicated data and the thinned data by dividing the number of cells/records by the percentage prevalence (to get the value of 1%) n_records / prev * 100 and multiplying it by the percentage of cells that the species wasn’t in (1-prev) * 100.
# length(unique(sp_y$TO_GRIDREF))
sp_y_prev <- sp_y %>% group_by(sp_n) %>%
mutate(n_cells = length(unique(TO_GRIDREF)), # number unique locations
n_records = length(TO_GRIDREF)) %>% # total number of records
ungroup() %>%
mutate(prev = n_cells/length(unique(TO_GRIDREF))) %>% # get the prevalence
dplyr::select(sp_n, n_cells, n_records, prev) %>% # select only columns of interest
distinct() %>% # one row per species
mutate(dup_rec_PA = round(n_records/(prev*100) *((1-prev)*100), 0), # calculate PA from duplicated records
thin_rec_PA = round(n_cells/(prev*100) *((1-prev)*100), 0)) # calculate PA from single records
sp_y_prev$dup_rec_PA <- ifelse(sp_y_prev$dup_rec_PA>6664, 6664, sp_y_prev$dup_rec_PA) # some PAs are to high for duplicated records
head(arrange(sp_y_prev, -n_cells))
## # A tibble: 6 x 6
## sp_n n_cells n_records prev dup_rec_PA thin_rec_PA
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Tyria jacobaeae 1605 6664 0.403 6664 2376
## 2 Odezia atrata 580 983 0.146 5764 3401
## 3 Zygaena lonicerae 572 825 0.144 4917 3409
## 4 Chiasmia clathrata 572 2067 0.144 6664 3409
## 5 Zygaena filipendulae 501 901 0.126 6258 3480
## 6 Ematurga atomaria 394 662 0.0990 6027 3587
The code for getting pseudoabsences from duplicated data while accounting for prevalence generated an error when the number of pseudoabsences was too high for certain species. Therefore, I just reduced the number of pseudoabsences to the maximum number of records which was for Tyria jacobaeae at 6664. This is a quick and dirty way of doing it but as I outlined above, this method doesn’t really make much sense in the grand scheme of things.
I’m going to get the absences/pseudoabsences on all of the species but then run the logistic regression models on only a subset of species because the code seems to crash R all the time.
Now need to generate the four different pseudoabsence data frames and run the logistic regression models. Going to run the models immeadiately after the pseudoabsence generation. To speed things up, I am only going to run the models for a subset of 6 species (Orgyia antiqua, Perizoma albulata, Tyria jacobaeae, Euclidia mi, Zygaena filipendulae, Chiasmia clathrata). However, the pseudoabsences and prevalence are calculated using all species.
Full dataset (with duplicated records) and matched number of pseudoabsences.
#### Create presence/absence
spp <- unique(sp_y$species) ## all species
spp_sub <- c('Orgyia antiqua', 'Perizoma albulata', 'Tyria jacobaeae', 'Euclidia mi', 'Zygaena filipendulae', 'Chiasmia clathrata')
# spp_sub %in% spp
registerDoParallel(cores = detectCores() - 1)
system.time(
ab1 <- foreach(i = 1 : length(spp)) %dopar% {
cpa(spdat = ndf, species = spp[i], matchPres = TRUE,
minYear = 2000, maxYear = 2017, recThresh = 5)
}
)
## user system elapsed
## 0.13 0.12 0.82
registerDoSEQ()
names(ab1) <- spp
# ab1 <- lapply(spp, FUN = function(x) cpa(spdat = ndf, species = x, matchPres = TRUE,
# minYear = 2000, maxYear = 2017, recThresh = 5))
Takes ~2 mins to run with only 6 species.
# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)
system.time(
lr_org <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
# .init=list(list()),
.packages = c('tidyverse', 'raster', 'readr', 'dismo'),
.errorhandling = 'pass') %dopar% {
# print(paste(s, spp[s], sep = " "))
if(is.null(ab1[s])){
# print(paste("No data for species", spp[s]))
next
}
sdm_lr <- fsdm(species = spp_sub[s], model = "lr",
climDat = ht, spData = ab1, knots = -1,
k = 5,
prediction = T,
inters = F,
write = F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
# save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
list(sdm_lr, se_out)
}
)
## user system elapsed
## 2.60 1.93 137.62
registerDoSEQ()
# lapply(spp_sub, FUN = function(x) {
#
# sdm_lr <- fsdm(species = x, model = "lr",
# climDat = ht, spData = ab1, knots = -1,
# k = 5,
# prediction = T,
# inters = F,
# write = F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
# se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
# list(sdm_lr, se_out)
# }
# )
The output of this function is a bit odd as it has a list within another list. The first list is each separate species, and the second list is the output from the fsdm function [[1]] and the standard error [[2]]. Obviously, with models that do not have a standard error prediction there will be no entry in the second list.
The number of pseudoabsences are determined by multiplying the number of records by 1-prevalence*100 of a species, as described above.
Generated pseudoabsences with the data containing duplicates of the same species found in the same cell over time doesn’t make sense because for some species will get a hyper-inflated number of PA points (e.g. species that are quite common in a very restricted area). This takes too long and crashes the computer so I’m going to leave it for now but the code is ready. At some point I still intend to run it to see what difference it makes to the SDMs though.
Thinned data and equal number of presences and absences.
registerDoParallel(cores = detectCores() - 1)
system.time(
ab1_thin_match <- foreach(i = 1 : length(spp)) %dopar% {
cpa(spdat = thinned_occ, species = spp[i], matchPres = TRUE,
minYear = 2000, maxYear = 2017, recThresh = 5)
}
)
## user system elapsed
## 0.08 0.01 0.36
registerDoSEQ()
names(ab1_thin_match) <- spp
# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)
system.time(
lr_thin_match <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
# .init=list(list()),
.packages = c('tidyverse', 'raster', 'readr', 'dismo'),
.errorhandling = 'pass') %dopar% {
# print(paste(s, spp[s], sep = " "))
if(is.null(ab1_thin_match[s])){
# print(paste("No data for species", spp[s]))
next
}
sdm_lr <- fsdm(species = spp_sub[s], model = "lr",
climDat = ht, spData = ab1_thin_match, knots = -1,
k = 5,
prediction = T,
inters = F,
write = F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
# save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
list(sdm_lr, se_out)
}
)
## user system elapsed
## 2.22 1.79 146.96
registerDoSEQ()
Thinned data and pseudoabsences are determined by multiplying the number of records by (1-prevalence)*100 of a species.
registerDoParallel(cores = detectCores() - 1)
system.time(
ab1_thin_prev <- foreach(i = 1 : length(spp)) %dopar% {
cpa(spdat = thinned_occ, species = spp[i], matchPres = FALSE,
nAbs = sp_y_prev$thin_rec_PA[sp_y_prev$sp_n == spp[i]],
minYear = 2000, maxYear = 2017, recThresh = 5)
}
)
## user system elapsed
## 0.10 0.03 0.47
registerDoSEQ()
names(ab1_thin_prev) <- spp
# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)
system.time(
lr_thin_prev <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
# .init=list(list()),
.packages = c('tidyverse', 'raster', 'readr', 'dismo'),
.errorhandling = 'pass') %dopar% {
# print(paste(s, spp[s], sep = " "))
if(is.null(ab1_thin_prev[s])){
# print(paste("No data for species", spp[s]))
next
}
sdm_lr <- fsdm(species = spp_sub[s], model = "lr",
climDat = ht, spData = ab1_thin_prev, knots = -1,
k = 5,
prediction = T,
inters = F,
write = F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
# save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
list(sdm_lr, se_out)
}
)
## user system elapsed
## 4.15 4.55 172.70
registerDoSEQ()
plot the results for each of the species in the dataset.
As you can see, thinning the data to only include presemce/absence points does affect the distributions (plots in the first column vs those in the second). Although the amount that it affects them and the direction it affects them in seems to differ between species (even with this very small subset of species). Thinning the data prior to running the model and choosing pseudoabsences based on prevalence has a drastic impact on the probability of presence. Increasing the number of pseudoabsences directly reduces the probability of occurrence. I think this is because it makes the model ‘expect’ the species to be less prevalent across the whole area of interest. This makes sense for some of the rarest species, as they are not expected to occur in 50% of cells. For some species, e.g. Perizoma albulata, this seems to make the model think that the species is unlikely to be present in any of the cells, even though it clearly is based on the raw data.
par(mfrow = c(1,3))
for(i in 1:length(lr_org)){
plot(lr_org[[i]][[1]]$Predictions, main = paste(lr_org[[i]][[1]]$Species, 'Original way',sep="\n"))
points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
cex = 0.6, col = "red", pch = 20)
plot(lr_thin_match[[i]][[1]]$Predictions, main = paste(lr_thin_match[[i]][[1]]$Species, 'Thinned + matched',sep="\n"))
points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
cex = 0.6, col = "red", pch = 20)
plot(lr_thin_prev[[i]][[1]]$Predictions, main = paste(lr_thin_prev[[i]][[1]]$Species, 'Thinned + prevalence',sep="\n"))
# points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
# lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
# cex = 0.6, col = "red", pch = 20)
}
par(mfrow = c(1,1))